Data Management

StartR Workshop

Maik Bieleke, PhD

University of Konstanz

November 24, 2024

Merging

What is merging?

Merging is the process of combining two (or more) data sets into one. Merging requires that the data sets have at least one variable in common, usually an ID variable.

There are four types of merging:

  • full join: keep all observations from both data sets
  • left join: keep all observations from the left data set
  • right join: keep all observations from the right data set
  • Inner join: only keep observations that appear in both data sets

We use functinos from the dplyr package.

install.packages(dplyr) # if not installed
library(dplyr)

Types of merging

Examples

We create two data frames with the shared key variable id. They have observations from different participants and different variables.

# Create two data frames
dfA <- data.frame(id = c(1, 2, 3), w = c(10, 8, 15), x = c(7, 9, 12))
dfB <- data.frame(id = c(1, 2, 4), y = c(2, 10, 11), z = c(5, 5, 3))

Now we can use functions from the dplyr package to merge the data frames.

dplyr::full_join(dfA, dfB)
  id  w  x  y  z
1  1 10  7  2  5
2  2  8  9 10  5
3  3 15 12 NA NA
4  4 NA NA 11  3
dplyr::inner_join(dfA, dfB)
  id  w x  y z
1  1 10 7  2 5
2  2  8 9 10 5
dplyr::left_join(dfA, dfB)
  id  w  x  y  z
1  1 10  7  2  5
2  2  8  9 10  5
3  3 15 12 NA NA
dplyr::right_join(dfA, dfB)
  id  w  x  y z
1  1 10  7  2 5
2  2  8  9 10 5
3  4 NA NA 11 3

Common problems

  • different names of the key variable

    dfA <- data.frame(id = c(1, 2, 3), w = c(10, 8, 15), x = c(7, 9, 12))
    dfB <- data.frame(IDENT = c(1, 2, 4), z = c(5, 5, 3))
    dplyr::full_join(dfA, dfB, join_by(id == IDENT))
      id  w  x  z
    1  1 10  7  5
    2  2  8  9  5
    3  3 15 12 NA
    4  4 NA NA  3
  • multiple key variables

    dfA <- data.frame(id = c(1, 1, 2, 2), wave = c(1, 2, 1, 2),
                      anx = c(10, 8, 15, 16), dep = c(7, 9, 12, 11))
    dfB <- data.frame(id = c(1, 1, 3, 3), wave = c(1, 2, 1, 2),
                      ang = c(2, 4, 11, 11), dis = c(5, 5, 3, 5))
    
    dplyr::full_join(dfA, dfB, by = c("id", "wave"))
      id wave anx dep ang dis
    1  1    1  10   7   2   5
    2  1    2   8   9   4   5
    3  2    1  15  12  NA  NA
    4  2    2  16  11  NA  NA
    5  3    1  NA  NA  11   3
    6  3    2  NA  NA  11   5

Reshaping

What is reshaping?

Reshaping is the process of transforming data without changing the data itself.

There are two types of reshaping:

  • wide to long: transform data from a wide format to a long format
  • long to wide: transform data from a long format to a wide format

We use functions from the tidyr package.

install.packages(tidyr) # if not installed
library(tidyr)

Types of reshaping

Wide-to-long

Wide

We start with a data frame in a wide format.

# Define a wide data frame
dfr <- data.frame(
  id = c(1, 2), 
  a_1 = c(10, 8), 
  a_2 = c(7, 9),
  b_1 = c(2, 11),
  b_2 = c(5, 5))
dfr
  id a_1 a_2 b_1 b_2
1  1  10   7   2   5
2  2   8   9  11   5

Long

Transform data frame to long format with pivot_longer().

# from wide to long
dfr_long <- pivot_longer(
  dfr, cols = c(a_1, a_2, b_1, b_2),
  names_to = "variable", 
  values_to = "value")
dfr_long
# A tibble: 8 × 3
     id variable value
  <dbl> <chr>    <dbl>
1     1 a_1         10
2     1 a_2          7
3     1 b_1          2
4     1 b_2          5
5     2 a_1          8
6     2 a_2          9
7     2 b_1         11
8     2 b_2          5

Long-to-wide

Long

This time, we start with a data frame in a wide format.

# Define a long data frame
dfr <- data.frame(
  id = c(1, 1, 1, 1, 2, 2, 2, 2),
  variable = c("a_1", "a_2", "b_1", "b_2",
               "a_1", "a_2", "b_1", "b_2"),
  value = c(10, 7, 2, 5, 8, 9, 11, 5))
dfr
  id variable value
1  1      a_1    10
2  1      a_2     7
3  1      b_1     2
4  1      b_2     5
5  2      a_1     8
6  2      a_2     9
7  2      b_1    11
8  2      b_2     5

Wide

Transform data frame to wide with pivot_wider():

dfr_wide <- pivot_wider(
  dfr,
  names_from = "variable", 
  values_from = "value")
dfr_wide
# A tibble: 2 × 5
     id   a_1   a_2   b_1   b_2
  <dbl> <dbl> <dbl> <dbl> <dbl>
1     1    10     7     2     5
2     2     8     9    11     5

Hidden identifiers

Often variable names in the wide format contain more than one piece of information. For example, the variable a_1 contains information about the variable a and the time point 1. This is called a hidden identifier.

Wide

# Define a wide data frame
dfr <- data.frame(
  id = c(1, 2), 
  a_1 = c(10, 8), 
  a_2 = c(7, 9),
  b_1 = c(2, 11),
  b_2 = c(5, 5))
dfr
  id a_1 a_2 b_1 b_2
1  1  10   7   2   5
2  2   8   9  11   5

Long

# from wide to long
dfr_long <- pivot_longer(
  dfr, cols = c(a_1, a_2, b_1, b_2),
  names_to = c("variable", "time"), 
  names_sep = "_",
  values_to = "value")
dfr_long
# A tibble: 8 × 4
     id variable time  value
  <dbl> <chr>    <chr> <dbl>
1     1 a        1        10
2     1 a        2         7
3     1 b        1         2
4     1 b        2         5
5     2 a        1         8
6     2 a        2         9
7     2 b        1        11
8     2 b        2         5

Aggregation

What is aggregation?

Aggregation is the process of combining multiple observations into a single observation.

There are two types of aggregation:

  • column-wise: combine observations from individual columns
  • row-wise: combine observations from multiple columns

We use functinos from the dplyr package.

Colum-wise aggregation

# Define a wide data frame
dfr <- data.frame(id = c(1, 2, 3), a_1 = c(10, 8, 5), a_2 = c(7, 9, 2),
                  b_1 = c(2, 11, 8),  b_2 = c(5, 5, 3))

Compute various colum-wise statistics:

# Compute mean of a single variable
summarize(dfr, MEAN = mean(a_1))
      MEAN
1 7.666667
# Compute mean and standard deviation of a single variable
summarize(dfr, MEAN = mean(a_1), SD = sd(a_1))
      MEAN       SD
1 7.666667 2.516611
# Compute mean of multiple variables
summarize(dfr, across(c(a_1, b_2), mean))
       a_1      b_2
1 7.666667 4.333333
# Compute mean and standard deviation of multiple variables
summarize(dfr, across(c(a_1, b_2), list(MEAN = ~mean(.x), SD = ~sd(.x))))
  a_1_MEAN   a_1_SD b_2_MEAN   b_2_SD
1 7.666667 2.516611 4.333333 1.154701

Row-wise aggregation

# Define a wide data frame
dfr <- data.frame(id = c(1, 2, 3), a_1 = c(10, 8, 5), a_2 = c(7, 9, 2),
                  b_1 = c(2, 11, 8),  b_2 = c(5, 5, 3))

Compute various row-wise statistics:

# Compute mean of multiple variables
summarize(rowwise(dfr), MEAN = mean(c(a_1, a_2)))
# A tibble: 3 × 1
   MEAN
  <dbl>
1   8.5
2   8.5
3   3.5
# Compute mean and standard deviationof multiple variables
summarize(rowwise(dfr), MEAN = mean(c(a_1, a_2)), SD = sd(c(a_1, a_2)))
# A tibble: 3 × 2
   MEAN    SD
  <dbl> <dbl>
1   8.5 2.12 
2   8.5 0.707
3   3.5 2.12 

Groupwise aggregation

We can aggregate for different groups separately using group_by().

# Define a wide data frame
dfr <- data.frame(id = c(1, 2, 3, 4), condition = c("A", "A", "B", "B"),
                  a_1 = c(10, 8, 5, 7), a_2 = c(7, 9, 2, 5),
                  b_1 = c(2, 11, 8, 4),  b_2 = c(5, 5, 3, 1))
dfr
  id condition a_1 a_2 b_1 b_2
1  1         A  10   7   2   5
2  2         A   8   9  11   5
3  3         B   5   2   8   3
4  4         B   7   5   4   1

Aggregate for conditions A and B:

# Compute mean of a single variable for each condition
summarize(group_by(dfr, condition), MEAN = mean(a_1))
# A tibble: 2 × 2
  condition  MEAN
  <chr>     <dbl>
1 A             9
2 B             6

Loops and conditionals

What is a loop?

A loop is used to repeat a sequence of commands multiple times, each time using a different value of a loop index.

A loop consists of

  • a loop index (e.g., i) that takes on different values
  • a loop vector (e.g., 1:3) with all values that the loop index should take
  • a loop expression (e.g., print(i)) with the commands to be executed for each value of the loop index

These elements are combined in a loop statement:

for(index in vector){expression}

for(i in 1:3){print(i)}
[1] 1
[1] 2
[1] 3

Example

Let’s use the example from above, where we computed means for different columns of a data frame.

# Define a wide data frame
dfr <- data.frame(id = c(1, 2, 3), a_1 = c(10, 8, 5), a_2 = c(7, 9, 2),
                  b_1 = c(2, 11, 8),  b_2 = c(5, 5, 3))

for(column in c("a_1", "a_2", "b_1", "b_2")){
  var <- dfr[, column]
  MEAN <- mean(var)
  print(MEAN)
}
[1] 7.666667
[1] 6
[1] 7
[1] 4.333333

Loops are rarely efficient, but especially beginners often use them because they are intuitive.

What is a conditional?

A conditional is used to execute commands only if a certain condition is met. They typically consist of if and else statements.

if(1 > 0){
  print("1 is larger than 0")
} else {
  print("1 is not larger than 0")
}
[1] "1 is larger than 0"

A frequently used special form is the ifelse() statement, which can be used to replace the values of a vector depending on a condition.

 [1] "smaller" "smaller" "smaller" "smaller" "smaller" "larger"  "larger" 
 [8] "larger"  "larger"  "larger"